Introduction

CCS1477 (Figure 1) is a small-molecule inhibitor of p300/CBP conserved bromodomain which could act as a therapeutic agent for treating lethal prostate cancer (de Bono et al., 2021)[1]. In the previous assignments,RNASeq data were obtained from GEO (GSE162564), duplicate gene IDs were removed, low counts were filtered out, and normalization was performed using a TMM (Trimmed Mean of M-values) method. The resulting dataset contained 14,462 rows with HUGO symbols as rownames, and 6 columns: 3 samples treated with CCS1477, and 3 controls (Figure 2, 3). Then, a heatmap was constructed based on the lists of significantly upregulated and downregulated genes as a result of CCS147 treatment in prostate cancer model cells (Figure 4, 5). Thresholded over-representation analysis was perform to obtain GO enrichment using g:Profiler (Raudvere et al., 2019)[2]. The GO terms associated with these genes showed that androgen response is downregulated with CCS147 treatment.

Figure 1. Chemical structure of CCS1477. Adpoted from de Bono et al., 2021.

Figure 2. Normalized RNAseq data showing the distribution for each treatment.

Figure 3. MDP plot shows clear separation between the CCS1477 samples and controls.

Figure 4. MA plot shows the log-fold change in expression for CCS1477 treated samples. Using edgeR::glmQLFTest, 1,212 genes were returned with FDR < 0.05.

Figure 5. Heatmap showing clear distinction between the expression patterns in treatment and control samples.

In this study, we will perform a non-thresholded Gene Set Enrichment Analysis (GSEA) to compare the results with the thresholded method performed previously. Then we will use Cytoscape to visualize the data to get a better understanding of how these pathways are related to each other.

Procedures & Results

Non-thresholded Gene Set Enrichment Analysis

First, load the data from the previous assignment:

# load the data from A2:
qlf_output_hits <- read.table(file=file.path(getwd(), 
                "qlf_output_hits.txt"),
                                    header = TRUE,sep = "\t",
                                    stringsAsFactors = FALSE,
                                    check.names=FALSE)

dim(qlf_output_hits)
## [1] 14461     6
knitr::kable(qlf_output_hits[1:10,1:5], type="html")
logFC logCPM F PValue FDR
STEAP2 1.646674 7.853514 1946.675 0 0
KLK2 2.914011 5.180125 1431.817 0 0
KCNMA1 2.488220 5.144159 1354.614 0 0
NKX3-1 1.388320 9.165626 1349.821 0 0
PMEPA1 1.566025 9.076398 1276.812 0 0
KLK4 2.146761 5.718396 1234.041 0 0
HMGCS2 2.179001 6.560366 1204.200 0 0
ASB9 2.279105 5.616970 1169.287 0 0
COLCA1 1.805184 5.994805 1150.890 0 0
TRGC1 1.903587 5.642413 1148.789 0 0

Create a ranked list:

qlf_output_hits[,"rank"]<-
log(qlf_output_hits$PValue,base =10)*sign(qlf_output_hits$logFC)
#sort table by ranks
qlf_output_hits <-qlf_output_hits[order(qlf_output_hits$rank),]
knitr::kable(qlf_output_hits[1:10,1:6], type="html")
logFC logCPM F PValue FDR rank
STEAP2 1.646674 7.853514 1946.675 0 0 -19.94755
KLK2 2.914011 5.180125 1431.817 0 0 -18.68846
KCNMA1 2.488220 5.144159 1354.614 0 0 -18.46174
NKX3-1 1.388320 9.165626 1349.821 0 0 -18.44725
PMEPA1 1.566025 9.076398 1276.812 0 0 -18.21999
KLK4 2.146761 5.718396 1234.041 0 0 -18.08081
HMGCS2 2.179001 6.560366 1204.200 0 0 -17.98087
ASB9 2.279105 5.616970 1169.287 0 0 -17.86080
COLCA1 1.805184 5.994805 1150.890 0 0 -17.79610
TRGC1 1.903587 5.642413 1148.789 0 0 -17.78864
# simplify the table
ranked_qlf_output_hits <- qlf_output_hits[,6]
ranked_qlf_output_hits <- cbind(rownames(qlf_output_hits), ranked_qlf_output_hits)
colnames(ranked_qlf_output_hits) <- c("GeneName", "rank")

knitr::kable(ranked_qlf_output_hits[1:10,], type="html")
GeneName rank
STEAP2 -19.9475548112424
KLK2 -18.6884606317337
KCNMA1 -18.4617439766545
NKX3-1 -18.447251176766
PMEPA1 -18.2199854817399
KLK4 -18.0808129309239
HMGCS2 -17.9808724723019
ASB9 -17.8607980301764
COLCA1 -17.7960977596078
TRGC1 -17.7886424450779

Export the list:

write.table(ranked_qlf_output_hits,file = file.path("ranked_qlf_output_hits.rnk.txt"),sep = "\t",row.names = FALSE,col.names = TRUE, quote = FALSE)

Using the GSEA software (v. 4.3.2) based on a joint project of UC San Diego and Broad Institute[4], GSEA was performd using the ranked list. The gene set (Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt) was obtained from Gary Bader lab at University of Toronto[5] with GO biological processes, all pathways and no IEA from March 2023 release.

Figure 6. GSEA settings with custom gene set sizes

Maximum geneset size was set at 200, and minimum geneset size at 15. If the gene set size is too small, the statistical power of the test may be low, making it difficult to detect true enrichment signals. Conversely, if the gene set size is too large, the test may become overly sensitive and detect false positive signals due to chance (multiple testing problem).

Permutation is required to ensure that the set of genes being tested is randomly distributed throughout the genome. By randomly shuffling the gene labels in the dataset, the gene set permutation creates a large number of random gene sets, which can be used to generate a null distribution of enrichment scores. The enrichment score of the observed gene set is then compared to this null distribution. This helps reduce the likelihood of obtaining false positives.

The GO terms obtained (Figures 7, 8) were almost identical to what we obtained using the thresholded method (Figures 9), where upregulated hits include pathways related to mitosis (chromosome segregation, spindle formation etc.) and downregulated hits include a wide range of pathways for amino acid synthesis and translation.

Figure 7. Upregulated GO terms obtained with a non-thresholded method using GSEA software.

Figure 8. Downregulated GO terms obtained with a non-thresholded method using GSEA software.

Figure 9. GO terms obtained previously with a thresholded method using g:Profiler. A = upregulated genes. B = downregulated genes.

Although we are simply comparing the lists of GO terms obtained by two different methods, it is difficult to “visually” see how these processes are related, and comparing tables is not intuitive. A more straightforward way to qualitatively compare the results is needed.

Visualizing GSEA in Cytoscape

In order to gain a more visual representation of the data, an enrichment map was created in Cytoscape (v. 3.9.1) using the results from the non-thresholded analysis.

Figure 10.

Figure 10.

FULL DISCLOSURE

I’m CR/NCR-ing this course. I ran out of time. I need to work on my thesis instead. I must end this report here. Thank you for your time!

Interpretation

Conclusion & Outlook

Reference

1. Bono JS et al de: Targeting p300/CBP in lethal prostate cancer. Cancer Discov 2021, 11:1118–1137.
2. Uku Raudvere IK Liis Kolberg: G:profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research 2019.
3. Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: A r t i c l e s PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. NATURE GENETICS VOLUME 2003, 34.
4. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. 2005.
5. Bader G: Index of /EM_genesets. 2023.